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CONVERGENCE OF SEQUENTIAL MONTE CARLO-BASED 
SAMPLING METHODS 

JONATHAN H. HUGGINS AND DANIEL M. ROY 


Abstract. Originally designed for state-space models, Sequential Monte Carlo 
(SMC) methods are now routinely applied in the context of general-purpose 
Bayesian inference. Traditional analyses of SMC algorithms have focused on 
their application to estimating expectations with respect to intractable distri¬ 
butions such as those arising in Bayesian analysis. However, these algorithms 
can also be used to obtain approximate samples from a posterior distribution 
of interest. We investigate the asymptotic and non-asymptotic convergence 
rates of SMC from this sampling viewpoint. In particular, we study the ex¬ 
pectation of the particle approximation that SMC produces as the number of 
particles tends to infinity. This “expected approximation” is equivalent to the 
law of a sample drawn from the SMC approximation. We give convergence 
rates of the Kullback-Leibler divergence between the target and the expected 
approximation. Our results apply to both deterministic and adaptive resam¬ 
pling schemes. In the adaptive setting, we introduce a novel notion of effective 
sample size, the oo-ESS, and show that controlling this quantity ensures sta¬ 
bility of the SMC sampling algorithm. We also introduce an adaptive version 
of the conditional SMC proposal, which allows us to prove quantitative bounds 
for rates of convergence for adaptive versions of iterated conditional sequential 
Monte Carlo Markov chains and associated adaptive particle Gibbs samplers. 
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1. Introduction 


Sequential Monte Carlo (SMC) methods are a widely-used class of algorithms for 
approximate inference 11 14 16, 17 1 . In the context of Bayesian inference, SMC 
produces a particle approximation to the posterior distribution as well as an un¬ 
biased estimate of the marginal likelihood. Traditionally, particle approximations 
were intended to be used to estimate expectations such as the posterior proba¬ 
bilities of events or the expected values of parameters, and the analysis of SMC 
methods focused on this operator perspective, i.e., the analysis of how well a par¬ 
ticle approximation approximates the expectation operator induced by the target 
distribution. 

Increasingly, however, SMC methods are being used to produce approximate 
samples, usually in the inner loop of another approximate inference algorithm. A 
key example is the class of particle Markov chain Monte Carlo (PMCMC) meth¬ 
ods, which aim to combine the best features of SMC and MCMC approaches by 
using SMC as a proposal mechanism for a Metropolis-Hastings (“particle MH”) or 
approximate Gibbs (“particle Gibbs”) sampler lj [l5]. Characterizing the efficiency 
of PMCMC methods is an active area of investigation 2-[5j [l8| [19 . 

When SMC methods are employed for sampling, convergence guarantees from 
the operator perspective are not appropriate. In this work, we take up the measure 
perspective on SMC, i.e., we characterize how well SMC-based methods can ap¬ 
proximate the target distribution as a measure by assessing convergence in terms 
of total variation distance, KL divergence, and other measures of discrepancy be¬ 
tween distributions. 

Let 7T be a distribution of interest and let n N denote an iV-particle SMC approx¬ 
imation to 7r. In this work, the majority of our attention is devoted to investigating 
the mean of n N , denoted n N , which can also be understood as the (marginal) 
distribution of a single sample drawn from the trued posterior n N . We use the 
Kullback-Leibler (KL) divergence from n to n N to measure performance of the 
SMC sampler. We begin by giving convergence rates for sequential importance 
sampling and sampling importance resampling (also known as bootstrap filtering) 
(Section 4). We provide non-asymptotic bounds under minimal assumptions, which 
can then be used to obtain quantitive bounds under a number of different condi¬ 
tions. These quantitative bounds are obtained using recent results from [2] for 
bounding the expected value of the partition function estimator with respect to the 
conditional SMC kernel. We also obtain quantitative bounds for the KL divergence 
from 7r to tt n for adaptive SMC algorithms (Section 5). In the adaptive case our 


approach uses the aSMC framework 23 , and thus applies to a large class of adap¬ 


tive algorithms. We introduce a novel notion of effective sample size, oo-ESS, the 
control of which is sufficient to guarantee convergence of aSMC samplers. 

We also propose a version of the conditional SMC kernel with adaptive resam¬ 
pling, which we call the conditional aSMC kernel (Section 7). Mirroring results 
from [2], we give conditions under which iterated conditional aSMC Markov chains 
are uniformly ergodic and the associated particle Gibbs sampler with adaptive re¬ 
sampling (the aPG sampler) is geometrically ergodic. As with the aSMC samplers, 
control of the oo-ESS is key to proving these ergodicity results. 
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2. Preliminaries 

After fixing some notation, we define the inference problem and present three 
SMC algorithms: sequential importance sampling, sampling importance resam¬ 
pling, and ctSMC. We then introduce the conditional SMC kernel, as well as a 
novel adaptive version based on aSMC, which allows us to define adaptive versions 
of the iterated conditional SMC process and an associated particle Gibbs algorithm. 

2.1. Notation. For a positive integer A', let [AT] = {1, 2,..., A'}. If x t ,, x 3 

are elements of a sequence, write Xij = (xi, rcj+i,..., Xj). We use the following 
conventions: = 0, Jj 0 = 1, and 0/0 = 0. 

Let ( S,S ), (S',S') be measurable spaces. K : S x S' —> R is a kernel if K(-,B) 
is a (S, S) -measurable function for all B £ S' and K(x, ■) is measure on (S', S') 
for all x £ S. For a measure /i on ( S,S ) and kernels AT, K' : S x S —> R, let 
yK(dy) = f y(dx)K(x,dy) and KK'(x,dz) = f K(x,dy)K'(y,dz). We will often 
use measures and kernels as operators. For a measurable function : S —>■ R, let 
y((f>) = [</(£)] = J (j)( x)y(dx) and I\(x)(4>) = J <p(y)K(x,dy). We will write 

— /t(0)] 2 ) f° r the variance of (j) with respect to y. For measures 
y,v on (S,S), we will write /i<Cvto denote that y is absolutely continuous with 
respect to v, in which case we will write for the ^-almost everywhere (ix-&.e.) 
unique function / satisfying y(A) = f A f dv, for all A £ S. When the choice is 
clear from context, we may write B(S ) for the tr-algebra of the space S. 

Let t3b(S) be the set of all measurable bounded real functions on S and let V(S) 
denote the collection of all probability measures on (S,S). For y, v £ V(S), the 
total variation distance between y and v is 

d T v(li,v) = sup \y(A) - v(A)\. (1) 

Ag5 

If y <C v, then the KL divergence from y to v is 

KL(/x||i/) = y(\ogdy/dv) (2) 

and the x 2 -divergence from y to v is 

d x i(y,v) = v([dy/dv-l] 2 ). (3) 

Finally, we note that, when there is little at stake, we will ignore measure- 
theoretic niceties such as the distinction between equality and a.e.-equality. 

2.2. Problem Statement. We follow a similar setup and notation to 9|. Let 

(X t ) t >i be an inhomogeneous Markov chain on ( E, £) with transition kernels (M t )t >2 
and initial distribution M\. We write Mi(x,-) = Mi(-) when convenient. For all 
t > 1 and x t -1 £ E, assume that M t (x t - 1 ,-) has a density with respect to some 
<7-finite dominating measure (which we denote by dx). We abuse notation and write 
the density of M t (xt-i, ■) as M t (xt~i,Xt). Denote expectations and variances with 
respect to the Markov chain by E[-] and Y[-], respectively. Let gt : E —> K + , for 
t > 1, be a sequence of £-measurable potential functions on E. 1 Denote a product 
of potentials by g s ,t( x s,t) = and let g 0 = 1. 


1 In the state-space setting the potential gt would be the likelihood from the observation at 

time f: gt(xt) = pt{yt \ x t ). 
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The unnormalized predictive and updated measures are defined, respectively, to 
be 

Vi,t(^i,t) — E (4) 

and 

l\A<t>\,t) = E [(j>i,t(X lit )g ltt (X ltt )] (5) 

with corresponding marginal versions 

7 t(^t) — E [4>t{X t )gi tt -i(Xi t t-i)\ = Qo,t{4>t) (6) 

and 

'y t (<j> t )±E[<j> t (X t )giAXi,t)\- (7) 

Our goal is to approximate, in a sense to be discussed later, the normalized pre¬ 
dictive and updated measures, and their marginal versions, which are defined to 
be 


/ , X A 7l,t(01,t) 

7Tl,tm,t) — v > 

At 

,, x A 7 m(0m) 

Vl,t(<Pl,t) — rr! ! 

(8) 

( i X A 7t(</>t) J 

TTt(^t) = ~ , and 

f, X A iMt) 

Vt{<Pt) — z , , 

(9) 


where Z t = 7 t(l) and Z[ = 7^(1) are normalization constants. 

For t > 1, let Q t (x t -i,dx t ) = and for 0 < s < t, let 

Qs,t — Qs+lQs+2 • • • Qt, ( 10 ) 

SO Q t ,t +1 = Qt- By convention Q t ,t( x udy t ) = S Xt (dy t ) and Qo,t(dx t ) is a measure, 
not a probability kernel. Notice that for s € [t], x s € E, and <j> t : E —► K, 

QsA x s)(^t) = ^[<t>t{Xt)g s ,t-i{X s ,t-\) | X a = x s ] (11) 

and QoA&t) = MtQiAfa)- Generalizing these identities, we will abuse notation 
and write, for s£ [t], x s e E, and : F 14-5 -A K, 

QsA x s){<Ps,t) = K[^ s AX s ,t)9s,t-i(X s ,t-i) | x s = x a ] (12) 


and Qo,t(0i,t) = MiQ u ((/> m ). Let G sAv) — Qs,t(y)A ) for s G [t - 1] and G 0 ,t = 
Qo,t(i)- 

Observe that 7i,t(0i,*) = Qo,t (</>i,t), 7 ’Mt) = QoAfa), and Z[ = Z t _i = G 0 , t , so 
the one-step predictive distribution and its marginal version can be written as 


Vi, Mi, t) = 


QoA<Pi,t) 


and r]t(<t>t) = 


Qo,t(<t>t 


2.3. SMC Algorithms. Sequential Monte Carlo algorithms construct approxima¬ 
tions to the distributions 7r l t , 7r t , r/ij, and r] t , for each t = 1,2,... in turn, and use 
earlier approximations to produce later ones. 
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Algorithm 1 Sequential Importance Sampling 

for n= 1 ,N do 

Sample X\ l from Mi 
Set X\ l A <- X\ l 

end for 

for t = 2,3,... do 
for n = 1,..., N do 

Sample A' t ra from M t (X™_ 1 ,-) 

Set Xi t 4r- (X? tt _ lt X?) 

end for 
end for 

S N S N S N S N 

Form the SIS joint and marginal estimators 7r^’ t , 7q’ , rj 1 ' t , rj t ’ . 


2.3.1. Sequential Importance Sampling. The SIS algorithm, given as Algorithm 1, 
operates by propagating a collection of N particles Xi it = {A{ l t }^ r =1 with corre¬ 
sponding nonnegative weights W t — {VF”}()Li> where 


W? = 


9i,t( x i,t) 


J2k=i 9i,t( x i,t) 

The updated distributions 7Tij and n t are then approximated by 

N N 

= and = 

n =1 n =1 

and the predictive distributions rji :t and rjt are approximated by 

vff = E w F-i s x?,t and rg’ N 4 jr WT^Sxn. 

n—1 n—1 


(14) 


(15) 


(16) 


The estimators of the normalization constants Z t and Z' t are Z t = 9i,t{ x i t ) 

and Z[ = A Y^k=i Si.i-i^it-i)- Expectations with respect the law of the SIS al¬ 
gorithm are written as E S,JV [-]. 


Remark 2.1. Since all four measures of interest take very similar forms, going for¬ 
ward we will only explicitly define quantities related to them — such as estimators 
- for the marginal measures 7if ,JV and i]f' N . 


2.3.2. Sampling Importance Resampling. A more practical algorithm, which does 
not suffer from the weight degeneracy problem of SIS, is sampling importance re¬ 
sampling (SIR) 14 16 . The SIR algorithm, given as Algorithm 2, is identical to SIS 
except for a resampling step performed after each iteration. Let W t — {W""} n=i 
denote weights for the particles Xi, t = {Xi t }^ =1 at time t, where 


wr = 


9t(Xt) 


EE 9t( x t) 


The SIR marginal estimators are 


n—1 


N 


and 


1 


R ,N A \T^ - r 

^ = E X s x; 

n—1 


(17) 
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Algorithm 2 Sampling Importance Resampling 

for n = 1 ,N do 

Sample X" from Mi 
Set X"-, 4- X{‘ 

end for 

for t = 2,3,... do 

Sample A t _i from r(-1 T4C_i) 
for n = 1,..., N do 

Sample X" from Mffxf^i 1 , •) 

Set X" t <- (x^h^r) 

end for 
end for 

Form the SIR joint and marginal estimators 7r ^’ t N , if^’ t N , r^ ’ N . 


and the estimators of the normalization constants Z t and Z[ are Z t = ~^r UUEti 
and Z' t = Z t -\. Expectations with respect the law of the SIR algorithm are written 
as E r,jv [-]. 

The SIR algorithm makes use of a resampling kernel with density r(a\w), where 
w G Ajy — {u £ [0, l] w | J2n=i Vn = and a £ [N] N . A minimal assumption 
required for the algorithm to be correct is that r(a n = k \ w) = Wk- However, we 
will focus on the simplest case where r(a\w) = r mu iti(a | w) = Y\ n=1 w arl , i.e., 
a„ | w ~ Multi(«;), for n = 1,. .., N. 


2.3.3. Adaptive Resampling. The SIR algorithm uses a deterministic resampling 
scheme. However, it is common for practitioners to use adaptive resampling algo¬ 
rithms to choose when to resample based on the realized particle weights. The most 
popular adaptive scheme relies on the effective sample size (ESS) criterion to deter- 

“ M)[20 


mine when a resampling step is performed 12 


[21 . Specifically, resampling 
occurs when the ESS is below some fixed threshold (e.g., N/ 2), where the ESS for 
unnormalized weights w = (w ±,..., wn) £ 




(Eti 


IS 


w n 


N 
2-^n—l 


(18) 


and takes on values between 1 to N. The nomenclature arises from interpreting 
£ 2 (w) as the effective number of particles the algorithm is using if the particles 
have weights w. 

Whiteley, Lee, and Heine 23 recently introduced a very general adaptive algo¬ 


rithm they call aSMC, which includes SIS, SIR, and numerous other SMC variants 
as special cases. The aSMC algorithm, which is given as Algorithm [3J provides a 
flexible resampling mechanism: at each time t, a stochastic matrix at -1 is chosen 
from a set Ajv of N x N matrices. We denote the value in the n-th row and fc-th 
column of at -1 by a"_ x . The aSMC marginal estimators are 


a,N 


N 

E 


W t n g t (X?) 
■\N 


£fc=i W t k g t (X?) 




and 


a,N 

V t = 


N 


wr 




x? 


9s(X*) 


(19) 
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Algorithm 3 aSMC 


for n = 1 ,N do 

Sample X" from Mi 
Set X^i £- Xf 
Set Wf t- 1 

end for 

for t = 2,3,... do 

Select at_i from Aat according to some functional of 
for n = 1,..., X do 
Set W? <- EfeLi 

’ fc=i. 


Sample from Multi 


w? 


N 


Sample X" from M t (X t \ * 1 , ■) 


Set Xfr <- (X^ t _i,X? 


end for 
end for 


and the estimators of the normalization constants Z t and Z[ are 

i N i N 

Zt=^Y. W t9t{X^) and Z' t ±-J2 W t k - ( 2 * «) 

fe=l fe=i 

Expectations with respect the law of the SIR algorithm are written as E a,iv [-]. 

Not only does the aSMC formulation aid in analyzing adaptive resampling strate¬ 
gies, it provides a useful framework for devising novel adaptive schemes with attrac¬ 
tive computational properties, such as admitting parallelization even on resampling 
steps. SIS, SIR, and the standard adaptive algorithm can all be obtained as special 
cases of aSMC as follows. SIS is recovered when a t ~i = /at, the N x N identity 
matrix, while SIR is recovered when at -1 = Ii/ag the N x N matrix with all entries 
equal to 1/N. The so-called adaptive particle filter (APF) algorithm is obtained 
by setting a t ~i to \\/n if £ 2 (^t-i) < (,N and to In otherwise, where £ £ (0,1] is 
fixed. 

Remark 2.2. We will write tt and ir N to denote a generic updated/predictive mar¬ 
ginal/full distribution and its associated SIS/SIR/aSMC estimator. Expectations 
with respect to the estimator’s law are written E iV [-]. 


2.4. Conditional SMC Processes. We will need to consider variants of SIR and 
aSMC in which one or more particle paths is fixed ahead of time. In the case 
of SIR, these algorithms are known as conditional SMC algorithms 00 In the 
case of aSMC we will refer to the algorithms as conditional aSMC algorithms. 
Throughout the section, fix t > 1, i > 1, and N > i. 

The i-times conditional SMC (c l SMC) process is defined on the space (E N x 
[X]^) 4-1 x E n x [ N ], and is essentially SIR in which the first i particle trajectories 
y\ t ,... ,y\ t £ E t are a priori fixed, with lineages £ [X] 4 . For ® 1)t £ 
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(E N y, ai,*-! G ([-/V]") 4 x , and a t € [N], the law of the c*SMC process is given by 

i N 

An 


jR. ,n 




( 21 ) 


l=i 


n—1 


for 5 = 2,..., 


t>R,n 


i 1 5^1 t ^ -Ag—1 — 0'S—1 | X.-l — *^s—l) 


AT 


H ,i -> tt ffs-i(a; s li 1 ) 


- n = k3 s~\) n 

1=1 


N fJ , M s{x s -i,dxl)- 


( 22 ) 


=! E7=l5 S -lK-l) 




and 


jR.1V 

Wi 


■; ihlit (A4 = 04|X t =*t)4 


5t(a 


E?=iM*?) 

If i = 1, we write cSMC instead of c 1 SMC. Also, note that the c*SMC process is 
not the same as conditioning on i trajectories and lineages of the SIR algorithm. 

The i-times conditional aSMC (c l aSMC) process is defined on the space (E N x 
[N] n x [IV]*) 4-1 x E n x [N] x [N] 1 , and is essentially aSMC in which the first 
i particle trajectories, but not their lineages, are a priori fixed. If / € [./V] 1 are 
indices of the first i particles, let V(f) = Ylj^f 1(/ J 7^ / J ) be the function that 
indicates whether the indices are distinct. Given trajectories y \ t ,..., y\ t € E 4 , 
for ®i ,4 € ( E N y , f lt G ([IV]*) 4 , G ([./V]^) 4-1 , and a t G [N], the law of the 

c*aSMC process is given by 


(23) 


1 


N 


Vl 


• i N t (X 1 edx 1 ,F 1 = f 1 )^C 1 V(fi)llN S y i {dXll) n M i(da;?), 

1=! 


(24) 


for s = 2,..., t, 


(X, G dx s , A s _! = a s _!, F s = f s | 

Xi, s _i = Xi t s-1, Ai tS _2 = 0-1,s-2, F s -i = f s - 1) 


and 


= C s V(f s ) 1J of f h ;/ ,( ( l,E )! („/', = f s l) 


1=1 


x r n (a™_ 1 \w s - 1 ,x ltS - 1 )M s (x a s s _ 1 1 ,x™) 


I — a t. | -X"l,t — *1,4, Ai )t _i — ai,t_l) — 


wVatjxV) 

En=1 w ?9t{x?) ’ 


(25) 


(26) 


where the C s are normalization constants that ensure the expressions are valid 
probabilities. We have used the notation 


N 


w? 4 1, 


^<VtiSt-i(4-i)> 


(27) 


fc=l 
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and 


,, , ^ a a s "-iwf-iS s -i(iti) 

1~n{k\W s --L,Xi s-i) = ---. 

W 1 } 


If i = 1, we write caSMC instead of c 1 aSMC. 


(28) 


2.5. Particle Gibbs and Iterated Conditional SMC. In the language of state- 
space models, the algorithms so far have been designed to approximate the posterior 
distribution of a Markov chain given indirect stochastic observations of the chain’s 
values. However, it is often the case that the chain and the potentials are controlled 
by a global parameter 9 £ 0 for which there is a prior distribution We 

replace M s by and g s by g 6 s , then parameterize the other quantities defined 
previously in terms of M s and g s by the additional parameter 9. Throughout this 
section we fix t and let ( Y , 3^) = ( E 4 , B(E t )). Since t is fixed we will suppress much 
of the time notation when possible in order to make the notation less cluttered. 
The target distribution on the product space (0 x F, £>(0 x Y)) is 

it(d9 x d y) = y(d 9 x dy)/Z 7 (29) 

where 

t 

7 (d 9 x d y) = Y[g e s (y s )M^(y s _ 1 ,dy s )w{d9) and Z = 7 ( 1 ). (30) 

S=1 

We denote the conditional distributions given 9 £ 0 or y £ Y by, respectively, 
7Te(d y) or 7r y (d 9). 

Particle Markov chain Monte Carlo (PMCMC) methods use SMC as a compo¬ 
nent within an MCMC algorithm to obtain approximate samples from 7 r(d 9 x dy) 
1|. We focus on the particle Gibbs (PG) sampler, which approximates the two-stage 
Gibbs kernel 


n(0, y, d$ x dz) = 7 r y (di?) 7 r^(d^). 


(31) 


In many settings, such as non-linear or non-Gaussian state-space models, it is pos¬ 
sible to sample from 7 r y (dt9), but difficult to sample from 7179 (dz). The idea is to 
replace 7179 (dz) with an SMC-based approximation n^(j/,dz) that leaves 7 r,j(dz) 
invariant, leading to a kernel of the form 7 r y (di?)n^(?/, dz). 

The standard PG sampler employs the iterated conditional SMC (i-cSMC) kernel 
[ 2 ] P$ ' N to approximate the conditional distribution: = P#’ N . For y £ Y, 

1 ? £ 0, and with 1 £ {1}^, the i-cSMC kernel is given by 


p R,N 


(y,dz) = E 


R,N 

y,i,& 



(32) 


R, N 

The invariant distribution of P g ’ is ng. The family of Markov chains with tran¬ 
sition kernels of the form P^' N (y, dz) are called i-cSMC processes. 

We now introduce the a particle Gibbs (aPG) sampler, which employs what we 
will call the iterated conditional aSMC (i-acSMC) kernel P$’ N to approximate the 
conditional distribution: = P% N . For y £ Y , 1 ? £ 0, the i-caSMC kernel, which 

has invariant distribution ng, is given by 



P , a,N / i \ A -aria!,iV 

* (y, d*) = E y ; M 


( 33 ) 
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(A) (b) 

Figure 1. An example in which E = R. and n and M are densities 
with respect to Lebesgue measure, (a) Plots of the densities n and 
M, and the potential function g t oc tt/M. (b) An example of an 
IS estimate n l ' N with N = 20 particles. The heights of the lines 
indicate the weights of the particles sampled from M. 



The family of Markov chains with transition kernels of the form Pg’ N (y,dz) will 
be called i-caSMC processes. 

3. Summary of Results 

We now provide a summary of our results concerning SMC for sampling directly 
and as a component of an MCMC algorithm. 

3.1. SMC for Sampling. The primary focus of our study will be on the expected 
value of the SMC estimators defined in Section [273] 

Definition 3.1. If n N is an SMC estimator of 7r, then the expected estimator is 
jt N = E JV [7r JV ], where for a measurable set A, [ 71 ^] (A) = E JV [7T Ar (A)]. <1 

To connect the expected estimator to the goal of sampling we note that the 
marginal distribution of a sample from n N is W N . To the best of our knowledge, 
there has been minimal investigation of expected SMC estimators, with |9j Chapter 
8] a notable exception. For example, the bound 

KL^IK) < (34) 

can be extracted as a special case of a more general propagation-of-chaos result [9j 
Theorem 8.3.2], Our interest will be in the other direction of the KL divergence, 
KL(7 t| |ff N ). In a certain sense, the direction of the KL divergence we investigate 
is the more “natural,” since KL(/r||u) is the expected number of additional bits 
required to encode samples from p when using a code for v instead 7 . In other 
words, it is the amount of information lost by using samples from v instead of 
samples from /i. Beyond this heuristic justification, however, we shall see that 
the quantities that arise when studying KL(7r||7f JV ) are intimately related to those 
appearing in the study of particle Gibbs and the related iterated conditional SMC 
algorithm. 

3.1.1. Special Case: Importance Sampling. Before detailing our results in full gen¬ 
erality, to provide intuition as to how well n N approximates tt &s N increases, we 
briefly consider the special case of importance sampling, which is equivalent to ei¬ 
ther SIS or SIR when t = 1. We write M = Mi, g = g±, n = tti, Z = Z-\, X = X\ , 
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n l ' N = nf' N , and n l,N = nf’ N . Fig. la gives an example of n, M, and g in the 
case that E = M. and 7 r and M are densities with respect to Lebesgue measure. 
Fig. |lb[ shows an example of an importance sampling estimate with N = 20 par¬ 
ticles. Fig. 2 shows the density of tt 1 - n , along with 7r and M, for N = 4,8,16,32 
particles. Informally, for a “small” number of particles, 7 t l,v is strongly distorted 
toward the proposal distribution M: for N not too large, with non-trivial proba¬ 
bility all the samples from M will be in a region of low 7r-probability. Hence, all 
the potentials (normalized by a factor of 1 /Z) will be small (<C 1). But in order 
to form the probability measure 7 r I,Ar the sum of the weights is normalized to 1, 
creating a overweighting in regions of high M-probability. However, as N increases, 
the probability of producing a sample from M in a region of high ^-probability (and 
thus with a large potential) increases, which induces a better approximation to 7 r. 

Using KL divergence to quantify the distortion in tt 1 ’ N , we have the following: 


Theorem 3.2. 


KL(7t| |7f I,JV 


) < log 1 + 


^ V[Z~'g{X)\ 


N 


< 


N 


(35) 


The variance Y[Z 1 g(X)\ is, in fact, the y 2 -divergence from 7r to M since 

V [Z~ 1 g(X)\ = J (Z- 1 g(x)-l) 2 M(dx) (36) 

= / M{dx) (37) 

= d x 2(TT,M). (38) 

A classical result for bounding the KL divergence in terms of the % 2 divergence can 
be recovered by taking N = 1. We then have KL(7r||7r I,JV ) = KL(7r||M), and so 
Theorem 3.2 implies that 

KL(tt| | M) < log (1 + d x 2 ( 7 r, M)). (39) 


3.1.2. Basic SMC Results. Many of our results concerning n N can be seen as anal¬ 
ogous to existing operator-perspective results, but from the measure perspective. 
To make the analogies transparent, we first give an operator-perspective result, 
followed by the comparable measure-perspective result. 

A typical L p bound for SIR states that, for any p > 1 and any (j) g Bb(E) |9 


e R,7V 




i/p 


< 


a{p)b{<t>)ct 

Vn' 


(40) 


where c t is a constant that depends only on {M s , g s } se [ t j. We will show that for 
SIS 


kl(t n. t 


S ,N 

hi 


)< 


St 

N 


(41) 


and for SIR 


KL(tt m 


R ,N 

r l ,t 


, K t 

) < — 
> - n 


©(TV" 2 ), 


(42) 


where S t and lZ t are constants depending only on {M s , g s } se uy All these bounds 
hold under very mild assumptions. 
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Figure 2. The expected IS distribution tt 1 ' N for N = 4, 8,16,32 
particles. For small numbers of particles tA' N is strongly biased 
toward the proposal distribution M. The density of n l ' N in each 
plot was approximated using a kernel density estimate. 


3.1.3. Adaptive SMC Results. Recall (cf. Section 2.3.3) that the effective sample 
size (ESS) of the aSMC particle weights at time t is 


(EtiWT) 2 

eLiW 


c 2 _A_ 
c t ~ 


n \2 


(43) 


and the so-called adaptive particle filter resamples particles when £? < (N for some 
fixed £ £ (0,1]. There are myriad heuristic arguments for using the ESS criterion 


20 ■ 22 as well as some theoretical analyses of the behavior of adaptive resampling 
algorithms under a variety of technical assumptions [8| 10, [23]. Whiteley, Lee, and 


Heine 23 provided a rigorous justification for the use of ESS from the operator 
viewpoint. Roughly speaking, they showed that if the ESS is not allowed to fall 
below (jlV, for a fixed parameter ( £ (0,1], then the SMC algorithm does in fact 
behave as if there are particles. More formally, for </> £ Bb(E ) and p > 1, under 
appropriate regularity conditions, 


supff > (N 

t> l 


sup E' 

t>i 


a, N 


\vT N ^)-vm p 


] 1 /p a(p)b((t>)c 0 


< 


vw 


Comparing Eq. (44) to [9| Theorem 7.4.4], which states that 


sup E 

i>l 


R ,N 




1/p , a(p)&(0)c o 


< 


vw 


(44) 


(45) 


we see that the condition sup t>1 £f > (N ensures that the effective number of 
particles in the time-uniform L p error bound is £1V compared to N particles if SIR 
is used. So in this technical sense ESS is a measure of the effective sample size. 

We show that a different notion of ESS leads to similar results for aSMC from the 
measure perspective, though we focus on fixed-time bounds, whereas the bounds 
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of 23 are time-uniform. We define the class of p-ESS measures, where p G [1, oo] 
is the parameter of the ESS measure. For p G (1, oo], the p-ESS is defined to be 

p/(p- 1) 

, (46) 



where if p = oo, thenp/(p—1) = 1. The standard ESS given by Eq. (43) corresponds 
to 2-ESS. We prove that, under standard regularity conditions, 


sup £ s °° >(N => KL(7r M ||7f“f) < ^ + 0(iV 2 ), 

se[t] 


(47) 


where lZ' t is a constant depending only on {M s , g s } s& ^. 


3.2. Sampling with PG and i-cSMC. In [2], conditions are given under which 
the i-cSMC process is uniformly ergodic and the PG sampler is geometrically er- 
godic. Specifically, under regularity conditions, there exists p t ,N = 0(l/iV) such 
that for all y G Y and for P = P ^' N , 

dTv{fi y P k ,ne) < Pt,N (48) 


and the PG sampler is geometrically ergodic as soon as the Gibbs sampler is geo¬ 
metrically ergodic. Furthermore, under a mixing-type condition, for any C > 0, if 
the number of particles is chosen to be N = Ct, then there exists 0 < pc < 1 such 
that sup t > x pt,N < Pc- 

We give similar results for the i-aSMC process and the aPG sampler. We show 
that under appropriate regularity conditions, there exists pt,N = 0(1/N) such that 
for all y € Y and for P = Pg ’ N , 


d T v(S y P k ,TTe) < Pt, N 


sup £ s °° > C N 
se[t] 


and 

the aPG sampler is geometrically ergodic 
as soon as the Gibbs sampler is. 


(49) 


Furthermore, under a mixing-type condition and a regularity condition on the ma¬ 
trices a G An, for any C > 0, if N = Ct, then there exists 0 < pc < 1 such that 
sup t >iPt,N < Pc- As in sampling with aSMC, maintaining a lower bound on the 
oo-ESS is an important ingredient to the proof guaranteeing the convergence of the 
i-aSMC process and the aPG sampler. 


4. Basic Results 

— S N — R N 

In this section we give convergence rates of KL divergences for , k 1 \ and tt 1 ' t . 
4.1. Convergence Rates. The key quantities in this section are 


St = V [Z^g^ t {X^)\ , 


lit = Zi 1 Y, Go, s 7t m (G m+1 ) - t, 


Qt = Z7 1 




(50) 
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Recall that for x s G E, G s j(x s ) = E[g S;t _i(JT Sj i_i) | X s = x s ]. For 1 < i < s < t, 
let 

Tt, a = {(n,... ,n) : t - s + 1 < ti < • ■ ■ < r e = t + 1} (51) 

and, for r G Te, s , define 

t -1 

Q(T,2/ M ) = n G n,r i+1 (2/n). (52) 

»=l 

We will sometimes write GJ'(t) or GJ(yi it ) instead of Ce(r,y i jt ). We will show 
(Proposition 4.8) that 

7 -l t +1 

E o 0iT1 T ltt (cj). 


(53) 


£=1 


TGTi.t+l 


Hence, is related to 77 t : 


St = + (iv f zr 1 £ Go, s 7n, t (G s , t+1 ) + 0(iV- 2 ) (54) 

S=1 


= l + ^ + 0(lV- 2 ). 


Theorem 4.1. For SIS, 


KLK t ||<f)<log(l+^j 


SA , s t 


and, for SIR, 


KL(7ri, t ||7rf i Ar ) < log£, = log ^1 + ^ + 0(7V 2 )j 

< ^ + 0(1V 2 ). 


(55) 

(56) 

(57) 


Pinsker’s inequality can be used to bound the total variation distance, though the 
SIR convergence rate is not optimal since, as § shows, dTv(^,^ n ' N ) = 0(1/N): 

Corollary 4.2. 


d TV { 7T,7T b ’ ) < A / ^ log ( 1 + — ) < 


St 


and 


d TV (Tr,n R ’ N )<J^\og(l + ^+Q(N-*)\ < J GX + ©(TV" 1 ). 


'1 


77. 


N 


St_ 

2N 


77, 


2N 


The following technical lemma will repeatedly prove useful: 

Lemma 4.3. Let X and Y be random elements in Borel spaces ( S,S ) and (T,T), 
respectively, let if : S x T —> R+ be a measurable, and let p be the distribution of 
X. If 

v = E[il>(X,Y)6x], (58) 

then v p and 


^(X)=E[tf>(X,Y)\X\ a.s. 


( 59 ) 
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Proof. Because S is Borel, there exists an / satisfying f(X ) = E (ip(X, Y) \ X] a.s. 
It follows from the chain rule of conditional expectation and then some elementary 
manipulations that, for all A £ S, 

v(A) = E[f(X)6 x (A)} = E [f(X)t A (X)} = [ f(x)p(dx), 

J A 

and so / is a version of the Radon-Nikodym derivative dp/d p. □ 


• • —SN 

Proposition 4.4. For the SIS algorithm, Tr 1 ’ t -C ni^ and 


,-S ,JV 

^Ws»)=E s -» 

d7Ti t 


Zt 

Zt 


*lt = vi,t 


(60) 


Although Proposition 4.6 (and Proposition 4.4 below) follow as special cases of 
Proposition |5.1[ the proofs of these results contain key ideas in simplified form, so 
we have included them. 


Proof. We have 


7 tff = E S,N 
= E S ’ n 


' N 

E 




'■^Eti9iAxi t ) 

giA x it) 

AXZ,t) 


6 X r- 


and so, by Lemma 4.3 and the definition of Z t , 


,-S ,N 
d7r l ,t 

d M 


—(yi,t) = giAviA ■ kS ' n 

i ,t 


1 

Jt 


Kt = vi,t 


(61) 

(62) 

(63) 


where M ljt = M 1 M 2 ■ ■ ■ M t . Upon noting that ^y(yi,t) = giAyi,t)/Zt, the result 


follows. 


□ 


— S N 

Lemma 4.5. and n 1 ’ t are absolutely continuous with respect to each other. 


Proof. By Proposition 


4.4 


— S N 

it suffices to show that 7r lit 7T-, ) . Note that for all 
there exists B C A such that Mij(B) = 0 and 


AeB{E t ),iff(A) = 0 
Vyi.t e A\B, gi,t(yi,t) = 0. But since 7n |t < M 1<t , M ljt (B) = 0 => 7n A b ) = 0 
and since for t/i it £ A\B,g\^{yi,t) = 0, ni tt (A\B) = 0 as well. So 7Ti,t(A) = 0. □ 

Proof of Theorem 1^.1 (SIS portion). By Proposition 4.4 and Jensen’s inequality 

'Z t 


,-S,TV 

d7r i ,t 

dTTl.t 


(vi,t) = K S ’ N 


> 


xi,t = yi,t 

N 


E S ’ N 


giAxit) \xl t = yi>t 

N 


N -1 + Z t 1 g 1 Avi,t) 


(64) 

(65) 

( 66 ) 
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Lemma 

yields 


4.5 


,-S,iV 

implies that -ppr = ("d^yy) _1 j which together with Jensen’s inequality 


l-S ,N\ _ /, d7Ti jt \ ( d7Tl ;i 


) = 7TM ( lo S^Tsjv ) ^ 


l,t 


-S,iV 


dn 


i,t 


i */r I d7r M d?r M 
-1 


< log M m 
= log ^1 + 
= log ( 1 + 


JV - 1 + Zi 91,t ry-1 


N 

Mi r t{Z t 1 ffi,t) 2 — 1 

N 

V[Z^g ht ( X lit )] 

N 


%t 9l,i 


(67) 

( 68 ) 

(69) 

(70) 

(71) 

□ 


To prove the SIR portion of Theorem [47T| a few additional definitions are needed. 
We can write the joint density of SIR as 


N 


N 


ip(xi,t,a lt t-i) = ( Mi(xi) J ( r(a s _! \ ui s _i) M s (a;“!/,<) ) , (72) 

where 


\n— 1 


s—2 


n —1 


77, A 77. / \ A 

< = w s {x s ) = 


9s{ Xt) 


(73) 


ELi <?*(<) 

and the carrier measure is implicit. For trajectory y\ t £ E l and lineage k\ t £ [IV] 4 , 

J 


we can write the density of the cSMC process with law P R i ,iv fe i (X^t, Aij-i) as 


N 


= II 1 ^ = = x * a ) ( n 

\n^k{ 


N 


(74) 


n ( r ( a s-i\w s ~i) n ) • 

n^kj 


Let b k = k and for s = t — b k = a b s s+1 , so x\ t = (aq 1 , x ^ 2 ,..., x b± ). 

Furthermore, if k = k\, then k\ = b k s and, by implicitly identifying x\ t with y\ t 
and b\ t with fc [ t , we can rewrite the cSMC density in the “collapsed” form 


N 


N 


4>(k, xi tt , ai it _i) = ( Mi(xi) ] ( r(a s _i | tu s _i) M s {x a s a _^,x n s ) 




8=2 






( 75 ) 
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For notational convenience, we will write E,y i -J ^[-] = E^ i ’ J ^ 1 [-] for expectations with 
respect to the law of the cSMC process with the trajectory y\ t = yi t and the fixed 
lineage k\ t = 1, the vector of all ones. 


Proposition 4.6. 

,-R,iV 

d7r l,t 

d7Tl it 

Proof. Consider the density 


(2/m) = KH 


\z t l 

Z t 

1 

[z t \ 

“ E R)iV 

i,t 



Then (similarly to 1, Theorem 2]) 


7r i,i( a: M)V’( fc > *i,t, 


W 


_ WfMi(x , 1 1 ) IIs=2 r multi(&s_l | W s -l)M s (x° s s _f , ) 


x\ t , ai^-i) 


N-^iAxit) 


Mijxi 1 ) nU m s[x s ‘_ i 1 , xs ’) nUi w s a 
N-h 

Mi(^) nU M.Cofc, off) nUr g«(s&) 

Zt 


K \ TT< 


Zt( x i,ti a i,t- 1) 


where 


a i,t-i) — E R-Ar [Zt | Xi, t — 

We thus have 
tti ’t ( d 2/i,t) 

= E / ' t l’( x i,t,ai,t-i)wt6 x k t (dy 1> t)dx 1 ^ t 


01,1-1 ,k ' 


ip{xi t , ait~i)Wf vj ,, v, 

-—77 - x ± tl a\ t -i)o x k (dyi t)dx\ t 

7ri,f(fc)*i,f,ai I t_i)J a .fc t (dj/i it ) > 



(76) 

(77) 

(78) 

(79) 

(80) 
(81) 

(82) 

(83) 

(84) 

(85) 


The result follows from Lemma f4.31 


V^l; *i,t! a i,t-i)^ lit (da;^ t )J7ri )t (d7/ 1)t ). (86) 

□ 


Definition 4.7. A collection of lineages fe 1 ,..., k‘ £ [JV] 4 are said to be distinct 
when k J s ^ k J s for all s £ [ t] and j, j' £ [i]. < 

The next result is a straightforward generalization of [2j Proposition 9]. 
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Proposition 4.8. For all t > 1, i > 1, N > i, y\ t ,..., y\ t £ E*, and distinct 
lineages ki t , 


t +1 




i=i 


tGTm+i 


m =1 j =1 


In particular, for y\^ t £ E*, 


t +1 


K;!?&]=wY,( N - i y +1 ~* E G 0 , ri Q(r,y M ). 


( 88 ) 


t=i 


— R N 

Lemma 4.9. 7Ti jt and ir 1 \ are absolutely continuous with respect to each other. 
Proof. The reasoning is analogous to the proof of Lemma |4.5[ □ 


Proof of Theorem f.l (SIR portion). The SIR bound follows from Propositions 4.6 
and|4.8[ Lemma |4dlf and Jensen’s inequality. □ 


Remark 4.1. As one should expect in the case of g s = 1 for s £ [t], 

KL(7r li t||-7rf; t iV ) = KL(7r M ||7f*f) = 0, 

and so SIS and SIR are equivalent from the measure perspective. Indeed, from the 
measure perspective, all that is required is a single sample from the chain (X t )j>i. 
However, when and Tr^’ t N are used as estimators, is superior to ^’ t N ■ 

Specifically, for (j) £ B^E*), SIS produces N independent samples, so 


Y S ’ n \ 


S ,N 
r l,t 


m 


N 


R N 

For simplicity, consider a version of ~k 1 ’ t obtained by first generating N samples 
from 7Ti it , then applying multinomial resampling to obtain the final N samples. In 
this case it is easy to show that 

(21V - l)Vfo(X M )] _ 2V S ’ N [Trff((j>)] 

N 2 N 


yR, N \ 


R,N 


m = 


so SIS is superior to SIR from the operator perspective. 


4.2. Quantitative Bounds. We can obtain explicit bounds in the SIR case using 
results from Andrieu, Lee, and Vihola [2] to derive bounds on the expected value of 
the normalization constant. To obtain bounds on the KL divergence from a bound 
on Kl^ t ] we use the following simple lemma, which follows immediately from 
the definition of Qt and Theorem |4.1| 

Lemma 4.10. If Zf 1 'K^ 1 ^[Z t ] < Bn for some Bn that does not depend on y±j,, 
then 

KL(7Ti ;t ||7f^i iV ) < log B n ■ 

In the case of bounded potential functions, a bound is straightforward to obtain. 
However, it requires the number of particles N to grow exponentially in t in order 
to remain constant. Let g s = sup xgE g s (x). 
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Proposition 4.11. Assume that g s < oo for all s € [i], Then for all t > 1, i > 1, 
N > i, y\ t ,..., y\ t £ E *, and distinct lineages fc lt , 

t 


E 


R ’l k jz t ]< z t \i+[i-(i-i/ N Y] 


zr'Il*.-! 


8 = 1 


(89) 


Proof. The proof is a straightforward generalization of that for Proposition 12] 
with some additional bookkeeping for i (instead of 2 ) fixed trajectory lineages. □ 

Corollary 4.12. A ssume that g s < oo for all s £ [t]. Then 


KL( 7 T llt ||i T?f) < log a + [1 - (1 - AT- 1 )*] 


zr'tlv.-i 


s =1 


< [1- (1-7V 1 ) 4 ] 

Kzr'nUiv.-i) 


z^' n^- 1 


S = 1 


< 


N 


□ 


Proof. Combine Lemma |4.10| and Proposition |4.11[ 

To obtain better dependence on t, stronger assumptions on {M t , gt.} are required. 
Assumption 4.A. There exists a constant /3 > 0 such that for any t, s £ N, 

Go,tGt,t+s(x) 


sup 

x£E 


G. 


< 


0 ,£+s 


Proposition 4.13. If Assumption \4-A\ holds, then for all t > 1 , i > 1, N > i, 
y\ t ,..., y\ t £ E t , and distinct lineages ki t , 

t 


Ef"[Z t ]<Z t ( 1 + 


W- 1 ) 

N 


(90) 


Proof. The proof is a simple generalization of that for (2j Proposition 14]. □ 

Corollary 4.14. If Assumption \J^A\ holds, then 

TZT , m-R.n-j^^, — 1^ (/3 — l)i 

KLfa.tlKi ) < *log i 1 + -^r-- ] < ———. 


Proof. Combine Lemma |4.10| and Proposition |4.13[ 


□ 


Assumption 4.A is implied by a standard “strong mixing” condition which is 
often employed in SMC analyses (e.g., [9| [23]). 


Assumption 4.B. There exists some m > 1 such that: 

(a) for some 1 < 7 < 00 , for any t > 1 and x, x’ £ E and A £ £, 

M t , t +m(x,A ) < 7 M t j +m (x',A), 

where = M t M t + 1 ■ ■ • M t+rn \ 

(b) for some 1 < <5 < 00 , the potentials satisfy 

sup < S 1/m . 

x,x'EE 9t\% ) 
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Lemma 4.15 (j2j Lemma 17]). If Assumption f.B holds, then for all t, s > 1, 


G t ,t+s{x) 

sup n - T~i\ - 1°' 

X,x'eE 'Grt,t+s\% ) 

so Assumption ^. A\ holds. 

5. SMC with Adaptive Resampling 

We now turn to the aSMC algorithmic framework to investigate the convergence 
properties of SMC with adaptive resampling. As summarized earlier, Whiteley, Lee, 

]Eil61I16 23 C’U/'yttt iVi ol mrwnfrwnvnrr r\ Iatttov Vimvn/-] A-f ^2 


show that maintaining a lower bound of £f > guarantees time- 
uniform convergence of aSMC in the L p sense at an O(^) rate. We will prove 
an analogous result from the measure perspective using a quantity £f° we call the 
oo-ESS. We show that maintaining a lower bound of £f° > fN suffices to guarantee 
fixed-time convergence of aSMC in KL divergence at an 0( Ay) rate. Note that 
our result is weaker than that of Whiteley, Lee, and Heine [23 in the sense that 
theirs is time-uniform while ours is only for a fixed time. We will show that £f° is 
a stricter notion of effective sample size than £ t 2 . 

We can write the joint density of the aSMC sampler as 


(91) 


\n= 1 


Vs—2 n— 1 


where 


rn(k\ w t.-i, *i,t-i) A 




(92) 


We will work under the following assumption: 

Assumption 5.C. For all N , all a £ An are doubly stochastic. 


Remark 5.1. Assumption 5.C is the same as Assumption (B ++ ) in 23 , though 


there it is stated as assuming each a admits the uniform distribution as an invariant 
measure. 

We begin by noting that, under Assumption |5.C[ the density of the caSMC pro¬ 
cess with law t, t -i, Fi t ) can be written in the following “collapsed” 

Vi,t 5 ’ ’ 

form, by implicitly idGntifying x{ n with y\ t (cf. Eq. (75)): 
a l,t-l, f l,t) 

ip a (x ht , oi, t i) n:= 2 W-r 1 


NMi (xf 1 ) n!=2 ff e (fa-l\w a - 1 ,Xi ia -l)M 3 (Xg‘_i,xi’) 


(93) 


=n M ^m \ isa s- 1 1 n 

njtfi s=2 V n^f s 


where I s = l(a{l 1 = f s - 1 ). Let T a be the cr-algebra generated by X ljS , Ai s _i, 
and F 1)S , where by convention we let J-q be the trivial cr-algebra. 
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Proposition 5.1. If Assumption 5.C holds, then the a,SMC estimator satisfies 


d7T° 


a,N 




N 


Z t 

“ K"[Zt] 


Proof. Consider the density 

7r“t(*i ,t, a i,t-i, fi, t ) ~ ho 

Then 

, 4 >a ( x i,t, a i,t-i)gt{x{ t )w{ t 

/i,t) En=i 9t{xt) w t 

-^i (^l 1 ) n^=2 (/ s -ik 8 -i, a:i, s -i)M s (sifri , a:{») 


7 T 1 nl=2 liV_1 En=l 9 t(Xt) w t 

Ml ( 4 1 ) n‘ s =2 aflEXEXXEi 1 )M s (x f /s 1 1 , x{-)g t (x{<)w {‘ 


TTl.tOqt) n:=2 ^ nl=2 Is a s -1 1N ~ X T,n=l 9t(Xt) w t 

nUi s* x x 1 4°) 


En=i nl=2 ^ 


„/t 

i,t 


®l,t— l) J"I S= 2 -Ts 


where 


Zt{xi,t, Qi,t-i) — E“' Ar [2't | — Xij, Aij-i — oi,t—i]. 

Using the convention that 0/0 = 0, it follows that 

*if (dyi ,t) 

i’ 0 ‘(x 1 ,t,a lti - 1 )g t (x% t )w1‘ t ^ 


= E 

ai,t-i,a 

= E 


= E 

={ u 


En=l 5* XX 




7rf it (*i,t, / 1)t ) En=i St XX 

Z t 

Zt{x i } t, o. i,t — i) 


^“t( a; i,t) a i,t-i ) / 1 ,t)^/t t (dy 1 ,t) 


(94) 


(95) 


(96) 

(97) 

(98) 

(99) 

( 100 ) 


( 101 ) 


( 102 ) 


(103) 


Zffxi t, CL\ t— l) 
The result follows from Lemma (4.31 


^“(^a; M ,a 1>t _i)(^ lit (da;f( t )j>7r M (d?/ M ). (104) 

□ 


With Proposition |5.1| in hand, our next task is to understand the quantity 
E y^[Z t \. To do so, we will require a generalized notion of effective sample size, 
which includes two commonly used definitions as special cases. 
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Definition 5.2. For parameter p £ [1, oo], let p* = be the conjugate exponent 
of p (so 1/p + 1/p* = 1). The p-efFective sample size (p-ESS) at time t is 


r ( \\wt\\i \ p * 

J \\\w t \\J 

\ |WjJ 

l u^{wn Wi 


(105) 




The following proposition 
Appendix A.l for a proof). 


catalogs some elementary properties of p-ESS (see 


Proposition 5.3. The p-ESS has the following properties: 

(1) The 1-ESS satisfies 

8/ = limff = Ef nt = e H ^ Wt \ (106) 

pfi 

where H(W) ± log ^7 • 

(2) For all p £ [1, oo], 1 < Ef < N. The lower bound is achieved if and only if all 
but one of the weights is zero. The upper bound is achieved if and only if all 
the weights are equal. 

(3) For 1 < p < q < 00 , Ef > £/ > E p , with equality if and only if £f 

and £/ equal 1 or N. 

Part (1) of Proposition 5.3 shows that the 1-ESS corresponds to the entropic ESS, 
which is a common choice of ESS in applications [61. Note that 2-ESS corresponds 
to the standard definition of ESS. Part (2) formalizes the sense in which p-ESS 
measures effective sample size: it always takes on values between 1 and N , and 
it takes on the value k £ [A"] when k particles have weight 1/k. Part (3) shows 
that the larger p, the more stringent the notion of effective sample size p-ESS is. 
Furthermore, for p < q, p-ESS is a strictly stronger notion of ESS than g-ESS. 

In order to prove our convergence result for aSMC, we will require a lower bound 
on the oo-ESS. 


Assumption 5.D. There exists some 0 < £ < 1 such that for all s £ [f], £f° > 
(N. < 


At a technical level, Whiteley, Lee, and Heine 23 uses the 2-ESS lower bound 
guarantee to bound the L 2 norm of the weights in terms of their Li norm. Similarly, 
we will use the oo-ESS lower bound guarantee to bound the sup-norm of the weights 
in terms of their L\ norm. Specifically, under Assumption |5.D| 

cw < = JA# = JA4, (io7) 


so for all n £ [A] and s £ [f], W/f < 


|| W s 

\\w,\\ 

CN 


sup„ Wf 


The expectation E y{ N t [Zt\ can be deconstructed in a similar manner to Ej ^^ [Z t 


Recall that for r £ 7r, s , 


(r) = H G r, 


i= 1 


(108) 
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Proposition 5.4. If Assumptions 5.C and \57D\ hold, then for all t > 1, i > 1, 
N >i, t G E t , 

E y[ N t [Zt] (109) 

t +i 


- NiCNy- 1 ^ ^ 


t+l-l 


l— 1 TgTi.t+l 
In particular, in the case of i = 1, we have 

t+i 


_ -\^( T i>i) i 

m =1 j=l 


C N 


E %”[Zt\< 


1 


N((N) 


w EE (civ) 


t+l-£ 


£=1 r67i,i+i 

See Appendix |A.2| for a proof. 


j\r - i\ 1 ( Tl>1 ) 

~w 


Go, Tl Cp (t) (110) 


5.1. Quantitative Bounds. Proposition 5.4 leads to quantitative bounds very 
similar to those for SIR. Indeed, the SIR results are essentially a special case of the 
aSMC results, though with slightly worse constants in the aSMC case. 

Theorem 5.5. If Assumptions \5X\ and \5.D\ hold, then for t > 1, 

Z t 1 2s=i Go, s ^i,t(G s ,t+i) — 1 


KL(7r M ||7r“f)< 
Proof. We have 


t +i 


jvkjvFiS £ <f w > 


(N 


t+l-l 


0(W 


— 2 \ 


(HI) 


= 1 TG7f,t+i 


N - i\ 1 ( Tl>1 ) 


Zt(N ~ 1) G l,t+1 1 Wn r y 0 / AT— 2 \ 

— n — + ^r + (N^ Go ’ sGs ' t+i+ ° { ] 


<Z t + Y,UlGo,eG!t, t+1 z t + 0(Ar _ 2) _ 


G 0 , Ti C«(t) (112) 
(113) 


(N 

The result then follows from Proposition 5.1 


(114) 

□ 

It is now possible to give generalized versions of Propositions 4.11 and [4.13 and 
their corollaries: 

Proposition 5.6. Fix any t> 1 and assume that for s G [t], g s = sup xeE g s (x) < 
oo. Then if Assumptions 5. C and 


Ej£[Z t ]<Z t ll + Z-'Hg 

{ S = 1 

and therefore, for N > £ _1 , 

KL(n 1 >t \\^f)<logll + Z^f[g s 
l S=1 


5.D hold, for all, i > 1, N > i, y\ t ,..., y\ t G E t , 

t r / i N 4 


i + 


(N 


- 1 


1+ Civ' 1 


<z^\[g 

s =1 

z^uUg 


< 


1 + 1 


(N 


(115) 

(116) 

(117) 

(118) 
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Proof. We have 


t+i 


K£&]<Y1 E (CN)- e+1 Go, ri C y e (r) 

1=1 T€7i, t +1 

t £+1 


S —1 £=2 

£ £ 


s^+iuei,!,)(<*> 


£- 1 


\-t+i 


- + II 9a E ( / ) (^) 


s=l f=l 
t 




= Z > + Ib 


1 + 


C N 


- 1 


(119) 

( 120 ) 

( 121 ) 

( 122 ) 

□ 


Proposition 5.7. If Assumptions\f.A\ \5. C\ an if 5.1) hold, then for all t > 1, i > 1, 
N>i, y{y\ t G E t , 


E v lt lZt\<Z t [l+-^ 


(123) 


and therefore 


KL^i.tlKf) ^ n °S ( X + ^ ( 124 ) 

Proof. The proof mirrors that for [2, Proposition 14]. Observe that for s G [t + 1], 
G 0 ,t +1 = G 0 ,t+i g n t+1 = Go iS T) s (G s ,t+i), so we can write for I G [t], r G 7e,t+i, 


— Go,t+l — Go.r,, Vri (G Tit T i+1 ). 


i= 1 


Combined with Assumption 4.A and writing G Sjt = sup^g^ G s ,t{x), 
t+i e-i 

E(c »)-‘ +1 E G °.nII G ?.,..„ 

^=1 reT^t+1 4=1 

£+1 ^ ^—1 EV 

< z , + 2,E«»)-' +1 E 

1=2 reTi t+ i °’ T1 *=1 


= z * eC ! 1 )(^)- £+i ^- 1 

^=1 ^ ' 


=1+ 


a 

Civ 


(125) 

(126) 

(127) 

(128) 

(129) 

□ 


6. Convergence in the Marginal and Predictive Cases 

So far was have focused on the convergence of , EE and ^l't* > which 
are all expectations of estimators of 7r l t . Very similar results can be derived for 
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the expected estimators of nt,r]i,t, and p t - The following result allows us to eas¬ 
ily generalize all the quantitative bounds given so far to these marginal and/or 
predictive measures. Define the reverse probability kernels dxqt-i) and 

fju-i(x t ,dxi,t-i) such that 

7iT,i(d.T M ) = 7r t (dx t )7r M _i(a;t,d2; M _i) (130) 

and 

? 7 i.t(dxi it ) = r] t (dx t )fj ljt -i(x t ,dxi t t-i). (131) 

Proposition 6.1. If Assumption 5.C holds, then the expected aSMC estimators 
satisfy 


d7T. 


-a,N 


d7r* 


f E? ,Ar 


/ 3/1, t 

A 




_j -ot,N 

\ _ nrce.IV 
Sl ' l) - 


d rf t 


'2 f r 

= E“ ,iV 

r^t-rl 

Ud 

3/1, t 

U t -iJ 


\,N 


dr] t 


-(*) = / KZ 


= / e: 


■,N 

yi,t 


Zi 
Z't 
Zt-1 


Zt -1 




dj/i,f_i). 


Proof. For Eq. (132), by Proposition 5.1] we have 


n?’ N (dyt) = 


d7T- 


>.,N 


l,t 


d7T! , 


[ E“ lJV 


/ 3/1,t 



7Tl,t(dt/l,t) 


= / KZ 


Zt 


7fi,t-i(2/t,dj/i,t_i)7r t (dy t ). 


For Eq. (133), consider the density 


Then 


)w( 


ai,t-i, /l,t) En=l 


nu i s a{‘irN-^ yZ=x < 

(4 1 ) n.L 2 «fi / r 1 1 ) M « of-i 1 . x i s > w {* 

viA x i,t) n := 2 *>!• nU isa^r 1 *- 1 Eti < 

M i (4 1 ) 11.1=2 g^i Ofid 1 )m s (zfr/, ) 

ELi < ni =2 

^ i 


d T.I a \nA 


where 


^(®i,t,ai,i-i) nuv 

^(*i,t, a M _i) = E a,JV [Z( | X M = ® M , A 1>t _r = a lit _r]. 


(132) 

(133) 

(134) 


(135) 

(136) 

(137) 


(138) 

(139) 

(140) 

(141) 

(142) 
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We thus have 
Vif (dyi.t) 

- £ J 

= E 


E JUf S -"' (d#1 '‘ ) ) 

1 ’jl.tfal.t, °M-1) / l,t) En=l w t 

x (dyi.t) 


(143) 


(144) 


- E ] 
-{ £ 






a M-i> /( d J/i,t) f (145) 


V’ a (^j*i,i) a i,t-i)<5y lit (da;i( t ) Wi it (dyi it ), (146) 


proving the first equality. The second equality follows since the estimator of the 
normalization constant can be written as 


% = ^ _1 E w ? = N ~" E E 

n n k 

= N~ 1 J2wt 1 gl 1 = Z t _ 1 , 


(147) 

(148) 


where the penultimate equality follows from Assumption |5.C[ 
Eq. (134) follows by the same argument used to prove Eq. 
Eq. (133). 


but using 
□ 


For 7r t , we have bounds identical to those for 7ri^: 

Theorem 6.2. For the SIS and SIR algorithms, 

KL(7r t ||7f t S,JV ) < log (l + ^ ^ (149) 

and 

KL(7rt||7f t RlAr )<l 0 gGt- (150) 


Proof. For SIS, 


d7r' 


S,N 


d7 Tt 


-(yi, t ) = / e S iJV 


N 


ZtiVsi,*(*£*) 

N 


> 


Kt = yu 


N - l + fZ t 1 g li t(yi,t)7r 1 ,t-i(yt,dy 1 ,t-i) 


ni,t-i(yt,dyi,t-i) 

(151) 

(152) 
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Hence, 


KL(7T t | 


-S,N 
7 r. 


) < J log 

< log 
= log 
= log 


N-l + fZ t 1 gi,t{yi,t)^i,t-i(yt, dyi,t-i) 


N 


N-i + fZt gi,t(yi,t)^i,t-i(yt,dyi,t-i) 


N 


7 Tt(d yt) 


77 * (d yt) 


1 + 


77l,t(^t 9l,t) ~ 1 


N 




For SIR, 


,-R.AT 

d7T t 


(yt) > 


Zt 


dn t jEy^[Z t ]jv 1>t -i(y t ,dy 1>t -i) 

and the bound follows analogously to the SIS case. 


(153) 


□ 


Using Proposition 5.1 and Theorem 6.2 together with the results from Sec¬ 
tions 4.2 and 5.1, the following quantitative bounds follow immediately: 


Theorem 6.3. For all t > 1, 

(1) if for s e[t],g s < oo, then 


TVT ( ||-R,ATn , 

KL(7T t ||7r t ’ )< 


r 1 


n‘=i 9s - !) 


N 


while if Assumptions 5.C and 5.D also hold, then for N > C 1 


KL(7T t | 


-a.N 
7r. 


)< 


Zt2 l nl=i9s 

C N 


and 


(2) if Assumption 4-A holds, then 

KL(n t \\^ N ) 


< 


(i9-l )* 

N ’ 


(154) 


(155) 


(156) 


KL(n t \\n?’ N )<^-. 


(157) 


while if Assumptions \5. C\ and \5.D\ also hold, then 

f3t_ 

CAT 

Remark 6.1. From the operator perspective, nf’ N , 7rf ' N , and nf’ N generally ap¬ 
proximate n t far better than nf’^, i and approximate 7Ti jt . It is quite 

natural for SMC to produce better estimates of the marginal expectation since, 
while both the marginal and joint estimators involve the same number of particles, 
the joint expectation involves an integral over a much higher-dimensional space. So 
it is somewhat surprising that the KL divergence bounds we obtain in the marginal 
case are identical to those in the full joint distribution case already considered. But 
in fact, intuition suggests that the KL divergence case will behave very differently 
from that of functional approximation. Since only a single sample is being drawn 
from 77^"^, or 7r“ : ) V , the quality of the full sample compared to the marginal 

sample does not suffer from the same “curse of dimensionality.” 
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Using Proposition 5.1 together with the results of Sections 4.2 and 5.1, we ob¬ 
tain similar results for estimators of the predictive measure 771 >t to those for the 
estimators of 7 Ti y. 


Theorem 6.4. For all t > 1, 

(1) if for s £ [t — 1 ], g s < 00 , then 


KL(? 7 i it 


-R,JV 


)< 


(t— 1) ris=i 9 S — 1) 


,t|l'Vl,t / - N 

while if Assumptions 5.C and 5.D also hold, then for N > 

Z't-^UTJxQs. 


KL( Vht \\fgf) < 


C N 


and 


(2) if Assumption 4-A holds, then 


KL(t 7 M 


-R,AG < iP - 1 ){t - 1) 


N 


while if Assumptions 5.C and \5.D\ also hold, then 

P{t~ 1 ) 


KL( m , t || fj?f) < 


C N 


(158) 


(159) 


(160) 

(161) 


7. Particle MCMC 


We now turn to particle MCMC methods, specifically i-cSMC and particle Gibbs. 
We will leverage results from Section [5j to prove mixing and convergence results for 
adaptive versions of i-cSMC and particle Gibbs under the condition that a lower 
bound on the oo-ESS is maintained by the algorithm. Briefly, let us recall the 
setting and some key definitions from Section 2.5, We parameterize the Markov 
chain (X t )t> 1 and the potentials by a global parameter 9 £ 0 for which there is 
a prior distribution zu(d9). Hence, M s is replaced by M 6 S and g s is replaced by 
g®. Throughout this section we fix t, and let ( Y,y ) be the measurable space with 
Y = E* and y = B{Y). Since t is fixed we will suppress most of the time notation. 
The target distribution is 

ir(d9 x d y) = y(d0 x dy)/Z, (162) 


where 

t 

7 (d 0 x dy) = Y{g e s (y s )M^(y s _ 1 ,dy s )zu(de) and Z = 7 ( 1 ). (163) 

S = 1 


We denote the conditional distributions given 0 or y by, respectively, 7To(dy) or 
TTy(d9). 

The particle Gibbs (PG) sampler uses the MCMC kernel 


n G (0,y,d$ x dz) = 7Tj,(di?)n,j(2/,dz). 


(164) 


First, in Section 7.1, we discuss the standard PG algorithm, where = P t 


R,IV 


the i-cSMC kernel. In particular, we summarize results from [2] and describe the 
close technical connections between their techniques and those from Section (4j, In 
Section |7.2[ we use results from Section [5] to prove convergence and mixing results 
for iterated conditional aSMC and aPG, where = P $’ N , the i-caSMC kernel. 
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7.1. Convergence of PG and i-cSMC. Recall from Section 2.5 that the i-cSMC 
kernel is given by 


7~)R., N ( 1 \ A -TT7iR.,A' 

P g (y,dz) = E ig 


x: 


(dz) 


(165) 


We will first state some properties of i-cSMC processes, then the particle Gibbs 
sampler. For notational convenience, we will write #[•] = for 

expectations with respect to the law of the c 2 SMC process with trajectories y 1 = y 
and y 2 = z , and lineages k\ t = 1 and k 2 lt = k. 

Proposition 7.1 (|2| Proposition 6]). For y GY and N >2, 


p f’ N (yAz) > Z< ^ k . n 1 ^ N 2 Tre(dg)- 


E 


(166) 


From 


Proposition 71 . is follows that if sup y zeY E^’j^ g [Z\ < Bn, the cSMC 
kernel satisfies the minorization condition 

p f' N {y^ z ) > £t,NX S (dz), (167) 

where e t ,N — Z ^ 1 ~bI N ' > • r ^'^ ie minorization condition, in turn, implies uniform 
ergodicity and a number of other types of convergence guarantees for the i-cSMC 
process. For a stationary Markov chain (£&)&>o with ^.-reversible Markov kernel K , 
the asymptotic variance for the function (j> G L 2 (S,/j.) is defined to be 

k 


I<] = lim ' 

k—> oo 




i= 1 


(168) 


Proposition 7.2 ([2, Proposition 7]). Let y be a probability measure on ( S,S ) and 
let II : S x 5 -) [0,1] be a probability kernel that is reversible with respect to y. If 
the stationary Markov chain defined by II is if-irreducible and aperiodic and there 
exists £ > 0 such that for all y G Y, n(y, ds) > ey(dz), then 

(1) for any probability measure vC/i and k > 1, 

d x 2 {vl\ k ,y) < d x 2 [y,y){l — e ) k , (169) 

(2) for any y G Y and k > 1, 

d TV {S y n k ,y)<(l-e) k , (170) 

and 

(3) for any 4> G L 2 (S,y) 

Y[</»,n]<(2c- 1 -l)Y ? ^[^)]. (171) 


apply with II = Pg and e = £t,N, where 1 — et,N = 0(1/N). If in addition 
Assumption \<j-A\ holds, then for every C > 0 there exists an ec > 0 such that for 
all t > 1, if N = Ct, then e tj N > £c > 0. 


Corollary 7.3. //supg^ < oo for all s G [t], then all the results of Proposition 7.2 


Proof. The first part follows from Propositions |7.1[ |7.2| and |4.11| The second then 
follows from Proposition 4.13 □ 


Remark 7.1. The second part of the corollary states that, if supgf < oo and 
Assumption |4. A] hold, then scaling N linearly with t ensures a uniform convergence 
rate, as measured by y 2 -divergence, total variation distance, or asymptotic variance. 
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Recall that 70 (dj/) = Y^ s = 1 g e s {y s )M e s (y s - 1 ,dy s ). The results of 0 Section 7] 
show how to use the convergence guarantees of the i-cSMC kernel to give conditions 
for the geometric ergodicity of the PG sampler: 


Theorem 7.4. If 

111 = 150,8 „ 

7r- ess sup- • < 00 

0 70 ( 1 ) 

or there exists 1 < /3 < 00 such that for any t,s £ N, 

Go,tGt,t+s{ x ) 


7r-ess sup 

0 ,: 


G, 


0 ,t-\-s 


<P- 


(172) 


(173) 


then as soon as the Gibbs sampler is geometrically ergodic, the PG chain is geo¬ 
metrically ergodic. 

7.2. Convergence of qPG and i-coSMC. We now generalize the results of 
the previous section to allow for adaptive resampling. Of particular note is that 
our results show that oo-ESS is a measure of effective sample size for adaptive 
resampling in this setting. Recall from Section [275] that the i-caSMC kernel can be 
used to define the aPG algorithm. For y £ Y, i-coSMC kernel is given by 


PZ’ N {y,&z)±K:Ze 


J x: 


■ (dz) 


(174) 


We call the family of Markov chains with transition kernels of the form Pg' N (y, dz) 
i-caSMC processes. 

Our primary technical goal will be to prove a minorization condition for P@' N ( y , dz) 
by generalizing Proposition |7.1| We begin by noting that, under Assumption |5.C[ 
the normalization constants for the c 2 aSMC process are given by 

N 


and, for s = 2,..., t, 


Ci = 


N 


N — 1 




kf y s _ 1 

1 a s -1 


k —1 


Let 


N 

kjv — max Y' 


a kn a kn’ 


and 


&n — &N V 


(175) 


(176) 


(177) 


for s = 2,..., t. Thus, for all s £ [i], C s < 1 _ 1 K ' • Here a V & denotes 


so C s < 

the maximum of a,b £ R. 

Proposition 7.5. For y £Y and N > 2, 

r>a,N, i .a ■— %t 


0 


(y,dz) > 


KZ&it=i c * 


-n(dz) > Zt{1 ~ K '. N)t 

' ' - mh.Al r G7 1 


itr a W 


7r(dz). 


(178) 


The proof of Proposition 7.5 can be found in Appendix A.3 We recover Propo¬ 
sition 7.1 as a special case of Proposition 7.5 since, for SIR, k' n = 1/N. 

Using Proposition |5.4[ we obtain: 
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Proposition 7.6. If Assumptions 5.C and \5/D\ hold , then for all N >2, 

£+1 / J\J q \ 1 (' 7 " 1 > 1 ) 

]& a ’ N 


VS ' €= 1 tGT m+ i V S 7 


where 


C( +z {t) 4 [] (G TilT4+1 (y Ti ) + G ri , Ti+1 (^)) . 


(180) 


2=1 


Hence, if Assumptions \5. C\ and 5.D hold and the potentials are bounded, then 


E y’^g[Zt] = 1 + 0(N x ), while if Assumptions f.A 


■“v,z,e 


5. C and 


5.D 


hold, then 


e ;;»[z<] < ( 1 +(I 


(181) 


Corollary 7.7. If Assumptions \4-A\ \ 5. C\ and \5.D\ hold, then for all y € Y, 

Pg’ N (y,dz) > e t<N ^{dz), (182) 


where 


£t,N = 


a (!- ivK 1 -*^) 


t~ 1 


(1 _L ?4L\t 


(183) 


CAO 


Furthermore, if N > /or some constant C > 0, where n' N = N 

then 


e t ,N > exp () . 


1 


(184) 


In particular, assuming k' n < B/N for some constant B > 0, if N > Ct + B, then 


et,N > exp ( - ^ - B ) . 


(185) 


Proof. The first part follows from Propositions 7.5 and 7.6, For the second part, 
we then have 


Ct.AT > 


i+jr 

1 — k> N . 


= 1 + 


> 1 


£ Ct) 


> exp — 


1 (W_ 

1 - k 'n \C n 

1 


k n 


(C 


The final part follows after noting that if k' n < B/N , then 


1-«jv 


l§+*vi> 


1 - B/N \(N 


2/3 


-f- + B/N = 


1 


N-B V C 


2 -P + b 


(186) 

(187) 

(188) 

□ 


Remark 7.2. In the case of SIR, Corollary 7.7 is almost as good as [2| Corollary 
15]: the former result replaces /3 — 1 with f3. 


The following, which generalizes Corollary 7.3 and Theorem 7.4, is a straightfor¬ 
ward consequence of Propositions |7.2| and [5.6] and Corollary |7~.7} 

Theorem 7.8. If Assumptions \5. Q and \5.D\ hold, then for N > 2, the i-caSMC 
process with kernel P = Pg' N 
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(1) is reversible with respect to ir and defines a positive operator, 

(2) if the potentials are bounded there exists e t ,N = 1 + 0{1/N) such that 

(a) for all y £ Y, P(y,dz) > e t ,N^e{dz), 

(b) for any measure v -C Tig and k > 1, 

d x 2 (vP k ,TTg) < d x 2(i/,7r e )(l - e t ,N) k , (189) 

(c) for any y £ Y and k > 1, 

drv{S y P k ,n g ) < (1 —e tiA r) fc , (190) 

(d) for any (j) £ L 2 (Y,n e ) 

V[<f>,P\<(2c-} - l)Y n [<j>], (191) 


(3) if in addition Assumption f.A holds and there is a constant B > 0 such that 
k' n — B/N, then for every C > 0 there exists an £b,C£ > 0 such that with 
N > Ct — B , for any t > 1, 


Furthermore, if 


£ t,N > £ b,cx > 0 . 
IlUi 9e,s ^ 

7r-esssup-—— < oo 

e 7e(l) 


or there exists 1 < /3 < oo such that for any t,s £ N, 

Go,tGt,t+ s {x) 


7 r-esssup— ~' ' < /3. 
e,x t^ 0 ,t+s 


(192) 


(193) 


(194) 


then, when Assumptions |57C| and \5.D\ hold and the Gibbs sampler is geometrically 
ergodic, the aPG chain is geometrically ergodic. 


Notably, our results show that oo-ESS is a notion of effective sample size in the 
setting of the i-caSMC and aPG algorithms. 

Remark 7.3. At a high level, we have seen that the expected value of Z t arises in the 
study of the mixing properties of iterated conditional SMC (i-cSMC) Markov chains, 
which are related to the convergence of particle Gibbs (PG) samplers. In order to 
show geometric ergodicity for particle Gibbs samplers, bounds on the expected 
value of Z t can be used, with growth of the expectation as t increases determining 
how well the particle Gibbs algorithm scales. Bounds on the expected value of Z t 
also allow one to obtain bounds on KL( 7 Ti j t|| 7 rf'j Ar ). 2 We have also shown that an 
analogous connection exists for the expected value of Z t between adaptive particle 
Gibbs and adaptive SMC for sampling. Hence, as a slogan, good performance of 
(adaptive) particle Gibbs is equivalent to good performance of (adaptive) SMC for 
sampling. 


"In the i-cSMC setting, the expectation is with respect to the “doubly conditional SMC kernel,” 
whereas we require expectations with respect to the “conditional SMC kernel.” However, this is 
only a small technical difference and, as we have seen, the same techniques apply in both cases. 
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Appendix A. Additional Proofs 


A.l. Proof of Proposition 


5.3 


For (1), the fact that £) = £f nt is a straightfor¬ 


ward algebraic manipulation. To prove the limit equality, write v = (vi,..., Vn) 
and observe that, using the Taylor series for x p and log(l + x), we have 
t/(i -p) 




ip 


lim 


= lim 

p—t 1 


= lim 

x—yoo 


EEolMli(p-l) fc log fc (IMIi)/fc! 

,ELiEr=o^(p-i) fc iog fe K)AE 

f EAoll^lli^Eog^ll^llO/fc! V 
VEtl Ek= 0 VnX- k log k (v n )/k\ 


1/(1 -P) 


= lim 


exp(log(l + E*=i x log (II^IIO/A:!)) 


x ^°° V ex p( 1 °s ( 1 + Efeli E n=i v ™IMIi lx ~ k iog te K)/fc!))) 

ex P( a: Em=l(- 1 ) m+ 1 [Efell a:fcl Og fc (ll' u l|l)A!] m 


= lim 

x—>oo 


ex P0 Em=l(- 1 ) m+ 1 [E^=l En=l V„\ Mil log k {v n )/k\] m ) 
exp(log(||u||i) + 0(* -1 )) 


= lim 

x—>oo 


exp 

MU 


(iog (n 


N v n 
n —1 Vn 


■0(a 


,-i 


n N v n 

n=l V n 


(195) 

(196) 

(197) 

(198) 

(199) 

( 200 ) 

( 201 ) 


To prove the remaining parts, we make repeated use of the following: 

Fact. For 1 < r < s < oo, and any vector v € R+, IMIs < |ML < A rl /' r_1 / s ||t;|| S) 
with the lower (upper) bound achieved if and only if v has one non-zero entry (v 
has all equal entries). 

For (2), apply the Fact with r = l,s=p>l, and note that in this case 
1/r — 1/s = 1 — 1/p = 1/p*. We then have 1 < |Mli/IMIp — , proving 

the result for p > 1. For p = 1, the result follows from part (1) and elementary 
properties of the entropy. 

For (3), in the case that p > 1, note that 

IHIf-P* > N (q»-p,)/q,\\ v \\q»-p, = N l~P./q. 11^11~P* 

= jV-P.(l/P-l/ 9 )||„|| 9 .-P. J 
where the final equality follows since 

1 ~P*/q* = 1 ~P*{ 1 - 1/?) = 1 —p* +P*/q = ~p*/p + p*/q- 
We conclude that 


Mi 
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I9.-P. 
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( 202 ) 

(203) 

(204) 

(205) 
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(206) 
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> 


Mil 

ML 




Mi 


Mi 


> [ iMl 1 


(207) 

(208) 

(209) 

where the first, third, and fourth inequalities follow from the Fact and the second 
follows from Eq. (202). 

The case of p = 1 follows from the p > 1 case and part (1). 

A.2. Proof of Proposition 5.4. We prove the result for i = 1. The general case 
follows from straightforward modifications. We will use the abbreviated notation 

Qit(-) = QsA-)(Xs) or = G s , t (X k s ) or G 8 , t (**), G y st = G s . t (y s ), 

9s = 9 s{X^) or g s (Xg), and g y a = g s (y s )• The variables are X a inside expectations 
and x a outside expectations. Throughout the proof, when limits of a sum are not 
specified, the sum is from 1 to N. 

The proof relies on the following lemma. 


Lemma A.l. Ifyi,t £ E t , then 

(1) for s = 2,..., t and any functions (f> a :E 


e[iV], 




Ecrara 


= E a s-i<t> f /(ys) + EEE <-1 

fs Is n^f s k 


^.nk nn k 

w -i“ ! - iVi qLw ); 


( 210 ) 


W" 


(2) for t £ [t-s], 


E a ’ N 

3/1,t 


Y, w s°i 


s+r 


Ira 


i 


( 211 ) 


< ^ G ls + r + 


and 

(3) for s = l, 


where, 


,t- 1, 




Z t I 


< At- S + St-s 


( 212 ) 
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( 214 ) 
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Proof. For (1), 
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For (2), choosing ^"(x) = w™G SiS+T (:r), we have 
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(218) 

(219) 

( 220 ) 
( 221 ) 


where the inequality follows from Assumption [5.D( and we have repeatedly used 
Assumption |5.C[ 

To show (3), we start by using (2) with s = t and r = 1: 


ga.AT 
V i,t 


E^w \Ft~i 


= 7^ E w t-i9t-i9 V t + 7^ E iGT-i.t+1 (222) 

^ fc ^ m 


(N 


— At -i + St- 


ii 


(223) 


Hence, (3) holds for s = 1. We now assume that the bound holds for some 
s £ {1,... ,t — 2} and establish that it also holds for s + 1. Using the inductive 
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hypothesis, 
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Using (2), we have 
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Summing the parenthesized double sums of the first two terms yields 

E E E m s - e Gy_ s>Tl cy(r) 

£=1 t€_Tl,s i=l ' r £7l >s 
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(234) 


so the first two terms are equal to A t _( s+1 ). Summing the parenthesized double 
sums of the last two terms yields 
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Using part (3) of Lemma A.i with s = t — 1, we have 
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Hence, using arguments analogous to those from the proof of Lemma |A.lj and the 
fact that G 0j i = 1 yields 


[A.+B,] 
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A.3. Proof of Proposition 7.5. First, observe that we can write the i-caSMC 
kernel as 
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where 
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Next note that 
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For the remainder of the proof, to keep notation compact when writing laws, instead 
of writing, e.g., X s £ x s or FJ = //, whenever a random variable is instantiated to 
be (the differential) of the lowercase version of itself, we will write only the random 
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variance: for example, X s or F]f. Now, for s = 2,..., t, 
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x P“;^[X S , A s _!, Fj',F* = k s | A 1)8 _ 2) i?_ lt f;_! = fc s _l]. 


Using Eqs. (248) and (250), we have (note that the terms such as those involving 
do should be ignored) 
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Ni(z ht eS) 



fce[JV]* 


(nl=i 



(253) 


-L . 

from which the result follows. 
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